BUG: Fix extra decimal places in DataFrame.to_csv() with quoting=csv.QUOTE_NONNUMERIC and float16/float32 dtypes (#60699) #60804

akj2018 · 2025-01-28T01:32:35Z

closes BUG: quoting=csv.QUOTE_NONNUMERIC adds extra decimal places #60699
Tests added and passed
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

Resolved by converting floats to strings to preserve decimal representation.
Removed unnecessary quoting=None logic for float arrays.
Added tests for float16, float32, and float64 cases with mixed values.

Issue

Dataframe.to_csv() generates extra decimal places in output when quoting=csv.QUOTE_NONNUMERIC , dataframe's dtype=float16 / float32 and float_format=None.

Reason

Dataframe.to_csv() internally uses get_values_for_csv() and when quoting is specified (=csv.QUOTE_NONNUMERIC), it converts numpy float array to object.

pandas/pandas/core/indexes/base.py

Lines 7751 to 7765 in 57d2489

    
           elif values.dtype.kind == "f" and not isinstance(values.dtype, SparseDtype): 
        
               # see GH#13418: no special formatting is desired at the 
        
               # output (important for appropriate 'quoting' behaviour), 
        
               # so do not pass it through the FloatArrayFormatter 
        
               if float_format is None and decimal == ".": 
        
                   mask = isna(values) 
        
                   if not quoting: 
        
                       values = values.astype(str) 
        
                   else: 
        
                       values = np.array(values, dtype="object") 
        
                   values[mask] = na_rep 
        
                   values = values.astype(object, copy=False) 
        
                   return values

`np.array(values, dtype="object")` affects `float16`, `float32` and `float64` differently

For float16, float32
- Have limited precision, therefore numbers are stored as approximations rather than exact values (8.57 stored internally in memory as 8.5703125)
- When converted to object array, internal binary representation of the float16 values is stored inside Python's float (equivalent to numpy.float64), which can fully display that exact binary representation
- Therefore, extra decimal places appear in the output for dtype=float16 and dtype=float32 when conversion to dtype=object

arr = np.array([8.57, 0.156, -0.312, 123.3, -54.5, np.nan], dtype=np.float16)
print(arr)
# [  8.57    0.156  -0.312 123.3   -54.5       nan]

arr_obj = arr.astype(object)
print(arr_obj)
# [8.5703125 0.156005859375 -0.31201171875 123.3125 -54.5 nan]

float64
- Due to 52 bits of precision, float64 represent most decimal numbers (like 8.57) exactly or with an extremely small error that is practically undetectable when converted to a higher precision or displayed as a Python float
- When you convert float64 numpy array to object, internal binary representation is directly transferred to the object type and there is no "extra decimals" in the output.

arr = np.array([8.57, 0.156, -0.312, 123.3, -54.5, np.nan], dtype=np.float64)
print(arr)
# [  8.57    0.156  -0.312 123.3   -54.5       nan]

arr_obj = arr.astype(object)
print(arr_obj)
# [8.57 0.156 -0.312 123.3 -54.5 nan]

Fix Implemented

To preserve the decimal representation in case of dtype=float16 and float32, we convert numpy float array to strings and then convert them back to Python's float which is nearly equivalent to numpy.float64

Conversion to str preserves decimal representation and prevents exposing the internal binary representation.
Conversion to float is necessary to avoid treating float values as string and storing them in 64-bit (double precision) preserves the string representation.

Additionally, in the original code
When quoting is None, converting first to str and then back to object is unnecessary work because the replacement of na_rep can be done directly on an object array (na_rep : str).

Therefore, quoting=None branch was removed to streamline the logic.

    elif values.dtype.kind == "f" and not isinstance(values.dtype, SparseDtype):
        # see GH#13418: no special formatting is desired at the
        # output (important for appropriate 'quoting' behaviour),
        # so do not pass it through the FloatArrayFormatter
        if float_format is None and decimal == ".":
            mask = isna(values)

            if values.dtype in [np.float16, np.float32]:
                values = np.array(values, dtype="str") # preserve decimal representation
                values = values.astype(float, copy=False) # preserve string representation 

            values = values.astype(object, copy=False)
            values[mask] = na_rep
            return values

Testing

Successfully pass all existing test cases in test_to_csv.py with tests added for dataframes with dtype as float16, float32 and float64 with mix of negative, positive and missing values and quoting=csv.QUOTE_NONNUMERIC

1. {"col": [8.57, 0.156, -0.312, 123.3, -54.5, np.nan]} and dtype="float16"

2. {"col": [8.57, 1.234567, -2.345678, 1e6, -1.5e6, np.nan]} and dtype="float32"

3. {"col": [8.57, 3.141592653589793, -2.718281828459045, 1.01e12, -5.67e11, np.nan]} and dtype="float64"

…xtra decimal points

… for float16, float32 in output

…quoting option enabled

…quoting

github-actions · 2025-02-28T00:07:20Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2025-03-03T18:22:19Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

akj2018 added 6 commits January 14, 2025 14:11

Add samply.py for example testing, modify get_values_for_csv to fix e…

1fdae29

…xtra decimal points

Modified base.py get_values_for_csv() to prevent extra decimal places…

dc86902

… for float16, float32 in output

Add tests to check to_csv for dtypes - float16, float32, float64 and …

c6aea1e

…quoting option enabled

Add entry in whatsnew for bugfix

d02a7a2

Fix ruff fomatting, linting, namespace issues using pre-commit hooks

9251283

Merge remote-tracking branch 'upstream/main' into bugfix/60699-tocsv-…

91fc354

…quoting

github-actions bot added the Stale label Feb 28, 2025

mroeschke closed this Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

BUG: Fix extra decimal places in DataFrame.to_csv() with quoting=csv.QUOTE_NONNUMERIC and float16/float32 dtypes (#60699) #60804

BUG: Fix extra decimal places in DataFrame.to_csv() with quoting=csv.QUOTE_NONNUMERIC and float16/float32 dtypes (#60699) #60804

Uh oh!

akj2018 commented Jan 28, 2025

Uh oh!

github-actions bot commented Feb 28, 2025

Uh oh!

mroeschke commented Mar 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	elif values.dtype.kind == "f" and not isinstance(values.dtype, SparseDtype):
	# see GH#13418: no special formatting is desired at the
	# output (important for appropriate 'quoting' behaviour),
	# so do not pass it through the FloatArrayFormatter
	if float_format is None and decimal == ".":
	mask = isna(values)

	if not quoting:
	values = values.astype(str)
	else:
	values = np.array(values, dtype="object")

	values[mask] = na_rep
	values = values.astype(object, copy=False)
	return values

Uh oh!

Uh oh!

BUG: Fix extra decimal places in DataFrame.to_csv() with quoting=csv.QUOTE_NONNUMERIC and float16/float32 dtypes (#60699) #60804

BUG: Fix extra decimal places in DataFrame.to_csv() with quoting=csv.QUOTE_NONNUMERIC and float16/float32 dtypes (#60699) #60804

Uh oh!

Conversation

akj2018 commented Jan 28, 2025

Issue

Reason

np.array(values, dtype="object") affects float16, float32 and float64 differently

Fix Implemented

Testing

Uh oh!

github-actions bot commented Feb 28, 2025

Uh oh!

mroeschke commented Mar 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`np.array(values, dtype="object")` affects `float16`, `float32` and `float64` differently